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The adequacy of several approaches to estimation of 
proficiency distributions for the Trial State Assessment (TSA) in 
eighth grade mathematics of the National Assessment of Educational 
Progress was examined. These approaches are more restrictive than the 
estimation procedures originally used, with the same kind of 
plausible-values approach that has been implemented since 1984 with 
the national assessment. The proposed approaches are computationally 
less burdensome and could provide improved performance from a 
statistical standpoint- Results from six procedures including the 
operational procedure for 1990 are compeured to results obtained by a 
criterion procedure for eight TSA jurisdictions. With some 
modifications, alternatives were generated by cruising the method of 
obtaining principal components, the method of selecting principal 
components, and the procedure for estimating the model. Methods that 
estimate a single population model did not provide acceptable 
results. Results support use of the less restricted models, 
particularly the one used operationally in 1990, producing separate 
sets of principal components for each jurisdiction from each 
jurisdiction's within-state correlation matrix, with decisions about 
the components to include made separately for each jurisdiction on 
the basis of the percent-of -trace criterion applied to the 
jurisdiction's correlation matrix. A separate population model is 
then estimated for each jurisdiction. Fourteen tables, 2 appendices 
presenting data from the analyses, and a 20-item list of references 
are included. (SLD) 
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THE USE OF COLLATERAL INFORMATION IN PROFICIENCY ESTIMATION 

FOR THE TRIAL STATE ASSESSMENT 

John Mazzeo 
Eugene Johnson 
Drew Bowker 
Y. Fai Pong 

E^Uicational Testing Service 
INTRODUCTION 

Beginning in 1984, most of the results for the National Assessment of Educational Progress 
(NAEP) have been reported in terms of IRT-based scales (referred to here as jmfidemy scales). 
Estimates of features of the distribution of proficiencies (such as means, percentile locations, and 
proportions of students above certain levels) are routinefy reported for important demogn^hic 
subgroups, as well as for groups of students defined by their standing on a host of educational 
relevant variables. 

In NAEP, the estimation of these profideiM^ distributions is based on a particular set of 
marginal estimation procedures sometimes referred to as the plausible vaiues oj^mach (Mislevy, 
Beaton, Kaplan & Sheehan, 1992) . As disaissed in Mislevy (1991), the computing aR)r(Mdmation 
used to cany out this approach is based on an extension of Rubin's (1987) multiple imputatwn 
procedures for survey nonresponse. Random draws, called plausible values in NAEP, are taken 
from predictive distributions, the parameters of which dq>end on students' rcsTonses to cognitive 
assesi.i:?nt items, other kinds of survey questions, and deir-ographic variables. Following Rubin, 
multiple draws < five oi them) are taken from the predictive distribution of each sampled examinee. 
When analyzed appropriately, these plausible values provide estimates of various features of the 
distribution of proficiencies, as well as appropriate estimates of the relationships between 
proficiencies and other variables, that are consistent under the model used to construct theoL 

The model-based predictive distributions which form the basis of the approadi consists of 
two components, a latent variable model, and a population model (Mislevy, 1991). IRT provides the 
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latent variable modds used in NAEP. The p(^ulation model, described further below, is similar 
in form to a multivariate regression model with IRT proficiencies being the depender variable and 
demographic variables, survey variables, and, responses to other badcground questio s forming the 
predictor varial^es. The parameters of both the latent variable and population models are 
estimated from the KAEP assessment data using marginal estimation proc^ures (see. e.g^ Mislevy, 
Johnson, & Murald, 1992). 

In 1990, NAEP carried out its first Trial State Assesanent (TSA) ir eig|ith-grade 
mathematics. As with the national assessment, resuii s were anafyzed and reported uring IRT-ba^ 
scales for each of the 40 jurisdictions that agreed to ic^te. Sq^arate scales we^ i produced for 
five mathematics content areas. Results for each particqiant were estimated usini the same kind 
of plausible-values based approadi that has been implemented since 19S4 wit a the national 
assessment As applied to the TSA, the e^imation procedures involved estimatini a single Ixtent 
variable model (the three-parameter logistic (3PL)), but separate population nodels (hence, 
separate sets of predictive distributions) for each of the particq)ating jurisdictions Mazzeo, 1991). 
The models were similar in form (i.e., linear muitivariable regi^ssion model> , however, the 
parameters of the model (i.e., the regression coefficients and residual variance ma rices) were free 
to vaiy across jurisdictions. 

The rationale for such an approach, discussed further below, was to ensure consistent (ie., 
asymptotically unbiased) estimation of results for important subgroups witl.in eadi of the 
jurisdictions. However, the approach was labor intensive, computationally intensive, and time 
consuming. With the planned expansion of the TSA to more academic subjects and more grades, 
along with desires for more timely reporting, a simpler, less labor intensive alternative producing 
comparable results would clearly be attractive. In addition, as discussed further below, simpler 
alternative procedure could, in principal, have more desirable statistical properties than the 
procedure actualfy used. 

The researdi described here scamines the adequacy of ^eral more rest ictive approadies 
within a plausiUe values framework. Because th^ are more restrictive, the e approadies are 
computationally less burdensome and could provide improve performance from a statistical 
standpoint Each of the procedures was used to reanalyze the results of the 19 ) TSA for ei^t of 
the participating jurisdictions. R^ults obtained from each of the pro(^ures ind the procedure 
used operationally in 1990 are compared to results obtained by a criterion pr H«dure (desaribed 
below). 

Overview of plausible values approach 

Since its inception in the late 1960's, NAEP has used matrix sampling methods in an attempt 
to provide broad content coverage within the subject matter domains while mauitaining acceptaUe 
limits on examinee testing tune. Beginning in 1984, a particular variant of matrix sampling. 
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balanced-incomplete blodc spiraling, was coupled with Item Response Theoiy (IRT) scaling methods 
to provide a new approach the analysis and reporting of NAEP results (Messick, Beaton, & Lord, 
1983). Since that time, NAEP results have, for the most part, been r^rted on scales derived from 

IRT was developed in the context of measuring individuals, where each examinee is 
administered enough items to estimate his or her proficiency ($) with a large de^ee of precision. 
In such cases, quantities that are of interest to consumers of NAEP rqx)rts (such as features of the 
distribution of $ for various demographic subgroups) are reasoi ably approximated by distributions 
of point estimates of 6 (sud) as maximum-likelihood estimates). However, this approach breaks 
down in the assessment setting where, due to limited testing time, eadi individual is administered 
relatively few items in a scaling area. The uncertainty associated with point estimates of (7 is too 
large to ignore, and the features of the distritnition of these ^timates can be serious^ biased as 
estimates of the distribution of $ (see, e.g., Mislevy, Beaton, Kaplan, & Sheehan, 1992). 

In order to circumvent such difficulties, most NAEP results (such as proficiency means for 
subpopulations, the proportion of examinees abo^ particular proficiency levels, and the 
relationships between profidoicies and educational variables) have be^ obtained using marginal 
estimation procedures which do not require point estimates for individual examinees. In NAEP, 
these marginal estimation procedures have b^n appnmmated using the so-called "plaimUe values" 
technology (Mislevy, 1991; Mislevy, Johnson, & Muraki. 1992; Mislevy, Beaton, Kaplan, & Sheehan, 
1992). The following is a brief overview of the plausible values approach as applied in the context 
of NAEP. A more thorough treatment can be found in Midevy (1991). 

In NAEP, samples of examinees are administered asses^ent in^ruments and background 
questionnaires and additional information is obtained from the t^dier's of the sampled examinees 
as well as the principals of their schools. Thus, for each examinee, the observe data available are 
responses to a subset of cognitive items included in the assessment, as well as a variety of 
background questions relating to demographic characteristks and other educationally relevant 
variables. Let y be a vector of containing re^nses of all asessed oEaminees to all background 
variables, attitude questions, and survey design variables (such as school membership, or type of 
community). In other words, y = (Xuh^-^Yf where y^ indicates the vector of background variaWes 
for the ith examinee. Similarly, let s=(Xi,S> "Sb)* refer to the vector of responses of these students 
to the cognitive items included in the assessment. In addition, let g°=(g„fi»...D* represent the 
vector of examinee proficiencies on the (possibly multqsle) IRT scales of interest 

If the g. were observed for each of the sampled examinees, it would be possible to compute 
a statistic t(g",s:) (e.g., a sample mean or sample percentile point for some subpopulation) to 
estimate some corresponding population quantity T. However, 9 is unobservable. Following Rubin 
(1987), £ is treated as "missing data" for all sampled examinees and t(©Ly) is approximated by its 
expected value gh^en the observed data (2t,y), 



(1) 



In NAEP, a Monte Carlo approximation to (1) is obtained by taking random draws from 
a predictive distribution of proficiencies for each examinee conditioned on their item responses and 
their refuses to background questions and survey variaUes. These predictive distributiDns are 
denoted here as p(ig ) s^,^. The values of these random draws are referred to as imputations in the 
sampling literature and plausiUe values in NAEP. 

Of course, in NAEP» as in most applications^ the predktive distributions are unknown. In 
order to evaluate the integral in (1), a particular modd must be assumed and estimates, denoted 
p($l&^> obtained. In NAEP, the predictive distributions are diaracterized as 

^^^^ . (2, 

In (2), k ' is a proportionality constant and P(3Si|fi), what Mislevy (1^1) refers to as the latent 
variable models is the likelihood for £ induced by the vector of responses to «>gnitxve items under 
an IRT model with a^nditional independence. The remaining piece, p(£|Xi)t referred to 1^ Mislevyr 
(1991) as the population model, is the joint density of proficiency conditional on the obs^ed value 
Xi of background r^>onses. This joint density is assumed multivariate normal with meat en by 
T% and covariance matrix X, where F is a matrix whose columns contain the coefficients for the 
regressions of eadi of the elements of g on y. The model for the joint density of proficiency 
conditional on the background data is known in NAEP parlance as the conditioning model 

In NAEP, estimates of predictive distributions are obtained in two steps. First, ass^sment 
data are used to estimate IRT item parameter using BILOG (Mislevy and Bock, 1982). These item 
parameter estimates are then treated as known and used to fit a linear model to the assessment 
data of the form 

fl = r'jf, ♦ fi (3) 

where e is assumed normally distributed with mean zero and dimension matrix X. Maximum 
likelihood estimates of F and X are obtained using Sheehan's (1985) MGROUP computer program. 
The program uses a variant of the EM solution described in Mislevy (1985) in which a normal 
approximation to P(s,|fi) is used (for further details see Johnson & Mislevy, 1991, or, Mislevy, 
Johnson, & Muraki, 1992). Based on the MLE's for F and X, and the normal approximation to 
P(Silfi). MGROUP then also produces an estimate of the predictive distribution for subject i, 
p(fi|s,»y.). from which the plausible values are drawn. 



The plausible values approach applied to the 1990 TSA 

As described in Mazzeo (1991), a common latent variable model for each of five content 
area scales was estimated for use with all 40 jurisdictions participating in the TSA. Procedures for 
estimating the populations models were more complicated 

Plans for reporting each jurisdiction's results required analyses examining the relationshqjs 
between proficiencies and a large number of background variables. The badcground variables 
included student demographic characteristics (e.g., the race/ethnidty of the ^dent, high^ level 
of eduration attained by parents), student attitudes toward mathematics, student behaviors both in 
and out of sdiool (e.g., amount of TV watdied daily, amount of mathematics homework each day), 
the type of math dass being taken (e.g., algebra, or general eighth-grade mathematics), the amount 
of emphasis provided by the students' teachers to various topics included in the assessment, as well 
as a variety of other a^>«:ts of the students' background and pr^aration, the background and 
preparation of their teachers, and the educational, social, and financial environment of the schools 
they attended. Overall, relationships between proficiency and more than 50 variables, taken directly 
or derived from the student, teacher, and school questionnaires, or provide by Westat, w&e to be 
estimated and reported. When eqpressed in terms of conttast-coded main effects and interactions, 
this resulted in a total of 167 variables (see KofQer, 1991, Appendix C for a listing). 

As d^cribed in Mislevy (1991), statistics that involve proficiencies and variables that are 
explicitly incorporated in the predictive model for 6 are consistent estimates of their corresponding 
population values. Statistics involving variables not induded in the model are potential^ subject 
to asymptotic biases. The degree of bias to be expected is a function of the complement of test 
reliability and the extent to which an omitted ii^riable is independent of the variable induded in 
the predictive model. 

To avoid bia^ in reporting results and to minimize biases in secondary anafyses, it was 
desirable to incorporate as many of the 167 contrasts into the predictive model as possible. The 
same background set of contrasts were induded in the predictive model for all 40 Trial State 
Assessment partidpants. However, a decision was made to estimate population models s^arateiy 
for each of the 40 TSA partidpants. Estimating separate population models for each state was 
more complex than the simpler alternative of estimating a single model for all 40 states. However, 
it was felt that there were significant potential problems associated with the simpler approach to 
warrant the more complicated approach. The need for separate population models for eadi state 
can be understood by examining the potential problems associated with estimating a single common 
modd. 

Under the assumptions of the model, estimating a single model for all 40 states would 
produce consistent estimates of the means for subgroups for those contrasts that were explidtly 
induded in the model. For example, since a Race/Ethnicity contrast was induded for Asian 
Americans, a consistent estimate of the mean proficiency of the total group of Asian-American 



ERIC 



5 

7 



students represented by those students who participated in the Trial State Assessment could be 
obtained from the single model But TSA results were to be reported separately for each state and 
for subgroups within the states. Given this reporting structure, the single model approach is 
problematk because it will produce consistent estimates of the mean proficiency of sul^ups within 
each state only if the magnitude of the effect associated with a particular contrast is identical across 
all 40 jurisdictions. Thus, the single model approach is tantamount to assuming there are no state- 
by'Contrast interactions. This assumption appeared unnecessarily re^rictive. The least restrictive 
approach, the one chosen, was to estimate separate population models for each jurisdiction. 

Estimating separate models for each jurisdiction was not without preplans, both practicai 
and theoretical. First, a number of exact and near multicolinearities ensted among these predictor 
variaUes within each of the jurisdictions. In a standard regression analysis (e.g. unwei^ted least 
squares or maximum likelihood estimation), estimation of regression coefBdents in the presence 
of such multicoUinearities often results in computing problems and numerical instabilities. The M- 
step of each cyde of M-GROUP's EM algorithm is carrying out a maximum*likelihood estimation 
of r based on suffident statistics calculated in its preceding E-st^. Hence, similar problems arise 
in MGROUP when one tries to obtain numerically stable estimates of r, and, consequently, X. 
Identifying these colinearities and removing them by sdectively deleting variables for each 
jurisdictions would be a time consuming task. A more time e£5dent alternative was to transform 
the original predictor variables into a set of linearfy independent variables extracting principal 
components from their correlation matrix. Principal components, rather than the original variables, 
were used as the y variables in the population modd. 

Estimating a regression modd by maximum-likelihood using a full set of principal 
components is equivalent to estimating a modd in terms of the original predictors (see, e.g., Joliffe, 
1986). To see why, let Z be an n-by-q matrix variable to be predicted, let W be an n-by-p matrix 
of standardized prediaor variables, let B be a p-by-q matrix of regression coeffident, and let E be 
a matrix of residual terms. Also let A be the matrix of orthonormal eigenvectors of WW and let 
V = XA be the matrix of principal components. Then, 

Z = XB ♦ E » XAA'B - E = FC * E (4) 

where G = A'B is the matrix of regression coefficients for the princqjal component scores. Even 
when roulticoUinearities are present, the full set of components can be used to obtain an unbiased 
estimate B=AG that avoids computational problems. However, such an estimate may be subject 
to undesirably large degrees of sampling variability (Gunst & Mason, 1977) 

An additional reason to worry about sampling instabilities in regression coefficient estimates 
is the large variables-to-observations ratio that would result from estimating separate predictive 
models for each jurisdiction. A typical sample size for the 1990 TSA participants was about 2,500. 
With 167 variables in the model, the ratio of variables to observations is about 15, which would not 
be considered large for a simple random sample. NAEP data are collected according to a 



multistage sampling design which inflates standard errors of most statistics relative to their standard 
errors under a simple random sampling design. Thus, because of the presence of multicollinearities, 
as well as the high variable to observation ratio, it seemed desirable to attempt to reduce the 
dimensionality of each states predictor variables. 

One way to obtain estimators of regression coefficients with smaller san^ling variances 
und^ the fixed effect model is to delete principal components associated with the smallest 
eigenvalues (see, e.g., Belinfante & Owe, 1986; Coxe, 1984; Gunsi Mason, 1977). Equivaiently, 
one can set coefBcients for those components to 0 in the model and obtain an estimate of O subject 
to those restrictions. If H is the matrix of regression coefBdents for the principal a>mpon^ts with 
appropriate rows set to zero, and H is its estimate, then B*»AH is the oorre^uniing restrkted 
estimated for the coefBcients in terms of the original variable B* has smaller Samsung variabiliQr 
then B, but may biased to the degree that the restrictions imposed on G are not true. Analyses 
by Kaplan and Nelson (see Nfislevy, 1991) on the 1988 NA£P Reading data suggested that a 
relatively smaU subset of principal components would capture almost all of the variance and xdost 
of the complex intercorrelations among the set of original variables and would reduce most of the 
potential bias for primaiy and secondary anafyses of NAEP r^ults. 

Based on the above considerations, population models for each jurisdiction were obtained 
by including only a subset of the full complement of principal comiK>nents as predictor variaUes. 
The princq>al components were produced sqparatefy for eadt jurisdiction. In othei words, we 
obtained a sq?arate eigenvector matrix A, for each jurisdiction based on that jurisdictions cross- 
product matrix of standardized predictors (W.). The subset of principal components to include in 
the model was then done sq)arately for each jurisdiction. In other words, the restrictions to be 
placed on G was decided on separately for each state. The number of principal components included 
in the peculation mcKiel for each jurisdiction was that number require to account for 
approximately 90 percent of the variance in the original contrast variables. Mislevy (1991), shows 
that this puts an upper bound of 10 percent on the potential bias for all analyses involving the 
original variables. Finally, population models were estimated separately for each jurisdiction based 
on onty that jurisdiction's data. In other words, sqjarate estimates of the matrix of residuals (S^) 

A A 

and separate restricted estimates of the model coefficients r,=A,G. were produced for each 
jurisdiction. 

Some alternatives to the approach used 

The prcx^dure used to estimate predictive models for the 1990 TSA participants was 
developed in an attempt to: a) avoid unnecessary assumptions about the constancy of the 
relationships between background variables and proficiencies across jurisdictions, b) avoid 
computational difCculties due to muhicoUinearities, and, c) reduce the potentially large sampling 
variability associated with using a high dimensionality predictor variable. The resulting procedure 
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vms, however, quite labor intensive and computationally burdensome. Because of the possiUe 
expansion of the scope of the TSA in the future, and the desire to shorten timelines for the analysis 
and rq)orting of results, there was considerable interest in considering less labor and less 
computationally burdensome procedures. If such procedure produced equivalent, or near 
equivalent results to the one used in consideration could be given to implem«iting these 
procedures in 1^ or in future NAEF asse^ments. 

From a more statistical standpoint, there was interest in examining alternative procedures 
for reducing the dimensionality of the set of predictor variables. As mentioned above, ddeting 
prind^ components with small eigenvalues is equivalent to using a re^ricted e^imator for the 
coefBd^ts of the population structure model whidi are possibly biased but may have smaller 
variance than an unrestricted estimator. If the r^trictions impeded are correct (le., if the principal 
components set to 0 are, in fact, unrelated to the variaUe to be predicted), sudi a restricted 
estimate also has smaller mean squared error than the unrestricted estimator. However, this is not 
ahvays the case. However, Lott (1973), for one, has provided examples in the context of prind;>al 
components r^ression in which components with small eigemmlues have substantial relationshqs 
with the dependent variable. Eliminating these components from the regression equation can 
produced restricted estimates with smaller variance, but their associated bias results in larger mean 
squared error (MSE). A logical alternative approadi would be to delete components with the 
smallest correlations with the dependent variable of interest Keeping components with the 
strongest correlation with the dependent variable tends to result in restricted estimators with latter 
variance, but smaller MSE. 

Based on the considerations discussed above, six alternative procedures for estimating 
population models were compared. With some modifications due to practical a>nsiderations, the 
alternatives were generated by crossing three factors: 1) method of obtaining princq^al components, 
2) method of selecting principal components, and, 3) procedure for estimating the mcxleL 

Factors 

Method of obtaining principal components (acro ss-state vs within-state) 

Two different methods of obtaining linear combinations of the original variables were 
compared. The first method, across-state principal components^ consisted of obtaining the principal 
components of an a^tgate correlation matrix of predictor variables. TTiis aggregate matrix was 
based on all available data from each of the 40 jurisdictions that participated in the 1990 TSA. The 
total sample size was about 100,800 with the sample sizes for the jurisdictions ran^g from 1326 
to 2843, with a typical sample size of about 2,500. The aggregate correlation matrix was produced 
using a rescaled version of the sampling weights used for reporting each jurisdiction's 1990 r»ults. 
The rescaling was carried out so that the sum of the weights for all jurisdictions were equal The 
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intent of the rescaling was to produce a correlation matrix pertinent to a synthetic popuiation 
containing approximately equal numbers of students from each jurisdiction. 

The second method, withmsttue principal components^ consisted of obtaining s^arate sets 
of principal components for each jurisdiction from their wiMnstate correlation matrix. As with the 
previous method sampling weights were used in generating the matrix. Hiis later method is the one 
that was used operationally for the 1990 TSA. 

In the notation of the previous section, the across state principal o^mponents involved 
obtaining a single set of eigenvectors A for use with all states. The values ot variaUes in the 
population model for state s were V„«W^ where W„ is a matrix of predictor variaUes 
standardized in terms of the .aggregate means and variances of the predictor variables. For the 
within-state component method, the values of the variables for the population model for state s 
were where is a matrix of predictor variables standardized in terms of the within- 

state means and variances of the predictor variables. 

Selecting principal components f % of trace vs r-squared criterion) 

Two approaches to selecting the subset of princq^al components to include in the model 
were used. The first approach, the approach used operationally for the 1990 TSA, was based on 
a "^percoit-of-trace" aiterion. The principal comiM)nents were ranked (from high to low) in terms 
of their associated eigenvalues and the first s components were selected such that the sum of the 
eigenvalues for these s components was greater Hban or equal to 90 percent of the trace of the 
correlation matrix. Appfying this criteria to the within state components produces a separate sets 
of restrictions on G for each state. Applying, the criteria to the across state components results for 
a common set of restrictions on G for eadi state. 

The second approach consisted of two slightly different variations, eadi of which used an 
"r-square" criterion. The first variation, applii^ to the within-state principal components, is based 
on the procedure suggested by Lott ( 1973) in the context of principal components regression. Lett's 
proc^ure is to include in the model that set of components which maximizes the adjusted squared 
multiple correlation (Kerlinger & Pedhazur, 1973, page 283) with the dependent measure. Because 
princqial components are uncorrelated by construction, the square multiple correlation (and hence, 
the adjusted squared multiple correlation) between the dependent variable and any set of these 
principal components is simply the sum of the squared zero-order correlations of the components 
in that set. This feamre makes Lott's procedure particularly simple to implement. The principal 
components are sorted and ranked (from high to low) on the basis of their zero-order correlation 
with the dependent variable. One then determines the rank order of the principal component at 
which the adjusted multiple-correlation is maximized and includes in the conditioning model that 
component and all components with lower rank orders. 

In the context of the present study, the dependent variable for the population model is B. 




Because of the muitiple-imputation procedures used in NAEP, accurate determination of the 
correlations between principal component scores and 0 would require that predictive models be 
alread|y available. So for practical reasons, correlations with a surrogate measure of proficiency 
were used to rank the principal components. The measure used was a logit transformation of the 
ptDportion of items answered correctly by eadi examinees. There were s^en distinct test booklets 
used for the 1990 TSA. Each booklet contains items firom eadi of the five content areas for which 
TSA scales were produced and the spiral administration procedure used in NAEP results in each 
book being administered to randomly equhralent samples within ^ch of the participating 
jurisdictions. Howev'^, the b(K>klets are not designed to be parallel in terms of level of difScul^. 
Therefore, the logit scores firom each booklet were ftirther standardized to have a mean of 0 and 
standard deviation of 1 in the aggregate sample of examinees that partidpated in the 1990 
assessment. 

Applying Lott's adjusted r-square procedure to the across-state principal components 
r^ult^ in some problems. Because of the extreme^ large sample size on which the across-state 
approach is based (100,800 cases), the adjusted-r-square criterion su^ested that virtual^ all 
principal components should be included. As discussed earlier, such an.outcome is undesirable for 
practical reasons. Therefore, an alternative r»square approach was n^ded for selecting the 
appropriate subset of across-state principal components. An examination of multiple correlations 
revealed that keeping the S3 components with the largest zero-order corrdations witli ii'git scores 
resulted in a multiple r-square that was 99 percent as large as the multiple r-square obtainable using 
all princ^al components. Therefore, this set of 83 components was used for the across-state 
principal component methods. 

As with the percent of trace methods, applying the r-square based criteria to the within state 
components produces a separate sets of restrictions on G for each state and appfying, the criteria 
to the across state components results for a common set of restrictions on G for each state 

Type of population mode! (within-state vs across-state) 

Two approaches were evaluated. The first approach, equivalent to what vms done 
operationally for the 1990 TSA, was to estimate separate population models for each state. In other 
words, s^arate estimates F, and £^ were obtained for each of the jurisdictions. The second 
approach, involved estimating a single population model for use in all the states. In other words, 
a single set of estimates. F and £ were produced. 

The six alternative procedures 

A total of six alternative procedures were obtained by combining the various levels of the 
factors described above. The six procedures were as follows: 
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1 - within-state pc's/wit:.in>state models /select bv % of trace 



jTBl 



This procedure (denoted as wwt) was the procedure used operationally for the 1990 TSA. 
It consisted of producing separate sets of principal components £rom each jurisdiction'^ within-state 
correlation matrix. Decisions about which components to irtdude m thie model were made 
separately for each jurisdiction on the basis of the percent-of-trace oiterion ai^l'- ! to each 
jurisdiction's correlation matrix. A sq>arate population model was then estim^ied for each 
jurisdiction. 

2 - across-state pc*s/within-state models/select bv % of trace 

This procedure (denoted as awt) used a single set of principal components for each of the 
jurisdictions derhred from the aggregate correlation matrix. Hie set to be included was determined 
by applying the percent-of-trace criterion to the aggregate correlation matt x A separate population 
model was then estimated for each jurisdiction. Note that, unlike the wwt procedure, the same set 
of principal con^nents was included in each jurisdiction's model or, ^uivalently, the same set of 
restrictions on the regression coefficients was imposed. However, sqsarate estimates of these 
restricted model co.fiicients were obtained for each jurisdiction. 

3 - across-state pc's/across-state model/select bv % of trace 

This procedure (denoted as oar) used the same set of across-state prindpal components as 
used for the awt procedure. However, unlike the previous two methods, a sin^e model was 
estimated and used for each of the jurisdictions. In other words, the same set restrictions was 
imposed and a single set of restricted coefficients estimated. 

4 - withm-state pc's/within-state model/select by adjusted r-square 

This procedure (denoted as wwr) is identical to procedure 1 except that decisions about 
which components to indude in the model were made on the basis of the adjusted r-squared 
criterion applied separately to each jurisdiction's correlation matrix. 

5 - across-state pc's/wtthin-state model/seiect bv r-square 

This procedure (denoted as awr) is identical to procedure 2 except that the set of pc's to be 
included was determined by applying the 99% r-square criterion to the aggregate correlation matrix. 

6 - across-state pc's /across-state model /select by r-square 

11 
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Denoted as aar^ this procedure used the same set of across-state principai components as 
the previous method but a single model was estimated and used for eadi of the jurisdk s. 

Methods 1 and 4 represent the least restrictive alternatives. Because estimation is carried 
out separately for eadi jurisdiction, the model being fit allows regression coefficients associated with 
any particular variable to differ across jurisdictions. Because the principal components are 
developed and selected sq)arately for each jurisdiction, a unique set of linear constraints are being 
imposed on eadi jurisdiction's coefficients. Methods 1 and 4 differ in that, within each jurisdiction, 
they impose different sets of constraints on the parameters. 

Methods 2 and S are slightly more restrictive. Again the estimation of separate models for 
each state allows for different regression coefficients for each state. However, the use of a sing^ 
set of princq>al components results in the same linear constraints being imposed on the model for 
aU of the states. Methods 2 and 5 differ in that, within eadi jurisdiction, they iropfy different sets 
of these common constraints. 

Methods 3 and 6 represent the most restrictive models. Fitting a single population model 
for all participants implies an identical set of same regression coefficients for all participants and 
using a single set of principal components implies the same set of restrictions. Again, methods 3 
and 6 differ only with respect to the particular linear constraints being imposed on the regression 
coeffidents. 

METHOD 

Dsta 

The data used in the study was from the 1990 Trial 3tate Assessment In order to keep 
work loads and costs at acceptable levels, a subset of eight of the 40 1990 participants were selected 
for the study. The jurisdictions included in the study evidenced a diversity of demographic profiles 
and exhibited a fairly wide range of performance on the assessment. The jurisdictions studied were 
California, the District of Columbia, Florida, Hawaii, New Jersey, North Dakota, Texas, and the 
Virgin Islands. 

Procedure 

Data from the 1990 TSA for each of eight jurisdictions were reanalyzed using one each of 
the procedures described above. EstimatioFf was carried out using MGROUF. In estimating the 
models, the operational 1990 IRT item parameters were used and results were obtained for all five 
content area scales. Since one of the alternatives (wwt) was the actual procedure used operationally 
for the analysis and reporting of the 1990 TSA results, no additional analysis was required for this 

12 

i4 



alternative. Hie remaining six alternatives required a partial reanafysis of the 1990 data. For the 
methods whtdi estimated a single ix)pulation model for all jurisdictions (aat and aar)^ principal 
components were determined and the population model was estimated using all available data from 
the 1990 TSA (i.e.« using data from all 40 partic^)ant$). Data were wei^t^ so that each 
jurisdiction contributed equally to the resuUing estimates. Hie same principal components used for 
the single population methods were used with the awt and awr methods. 

The major purpose of using the plausiUe values approach in NAEP is to ensure that 
consistent estimates of important subgroup differences, such as male/female and white/black 
differences can be obtained. Therefore, the proceduies were rampart by focussing prunarify on 
the magnitude of estimated subgroup differences on the NAEP proGdency scales and estimates of 
the within-subgroup variability in proficiency. Comparisons were confined to two of five content 
area scale, Numb^ and Operations(NO), and Data Analysis, Statistics, and Probability (DASP). 
The NO scale is the longest of the 1990 grade 8 scales and, with a typical ejcaminee being 
administered about 20 items. The DASP scale is the shortest of the 1990 grade 8 scales with a 
typical examinee being administered 8 items. The in^x)rtance of the conditioning model used 
depends on the amount of information available about individual profu.«encies (Mislevy, Johnson, 
& Muraki 1992). Consequently, the greatest differences between the methods is to be expected 
for the DASP scale and least for the NO scale. An additional leptimate criteria for comparing the 
procedures involves comparing them in light of estimates of the variability (due to sampling and 
other sources) associated with each. 

Each of the above alternatives was evaluated by comparing its results to those obtained using 
a dose approximation to a full maximum-likelihood »>lution under the least restrictive modd. A 
fiill maximum-likelihood solution under the least restrictive model would entail separate coelEBcients 
for each jurisdiction for each of the predictor variables induded in the population model Sudi a 
modd would result in unbiased ^timates of the coeffidents, although the error ^^uiances for the 
model coeffidents could be somewhat larger than those obtained under a more restrkted model 
Within a prindpal components framework, such a mcKid would be estimated by using sq)arate sets 
of components for each jurisdiction, induding all components with non-zero eigenvalues, and fitting 
the model sq^aratefy within each jurisdiolon. However, the indusion of prindpal components 
associated with near zero eigenvalues often results in computational instabilities. 

The criterion procedure actually used was an approximaUon to the fiiil maximum-likelihood 
solution which avoided the computational instabilities. The procedure involvi^ obtaining separate 
sets prindpal components for each jurisdiction and deleting prindpal components which showed 
evidence of computational instabilities. Such instabilities arise for the components with cstremely 
sm.til eigenvalues and are evident when one examines correlations among such components or with 
thej,e components and other variables. Across the jurisdictions that we studied, such instabilities 
could be effectively eliminated by deleting the set of principal components with the smallest 
eigenvalues such that the sum of the eigenvalues in the set was about 1 percent of the trace of the 
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matrix of ail eigenvalues. Hie remaining principal components were then used to estimate a 
separate population model for each jurisdiction. 

Appendix 1 shows the total number of original variables, the number of exact coiinearities 
(le., the number of 0 eigenvalues) and the number of principal components deleted due to 
computational instabilities. Appendix 2 shows total group and subgroup sample sizes for eacn of 
the ei^t jurisdictions evaluated in the study. 

RESULTS 

Table 1 shows the number of principal components included in the model for the meiiods 
that selected principal components sq>arately for each state. The number selected (or, equivalLntfy, 
the number of restrictions imposed) by the adjusted mteria was less than or equal to the number 
selected by the percent of trace criteria for all but one jurisdiction (North Dakota). For the 
methods using a single set of principal components for all states, the percent of trace criteria 
resulted in substantially more components being selected (94) than did the 99 percent of R' cr:teria 
(71). 

Table 2 provide estimates of the mean and standard deviation of the aggregate population 
of students firom the ei^t jurisdictions induded in the study, as estimated by the criterion method 
and each of the six alternative methods. For the NO scale, estimates of the means and standard 
deviations are quite similar for all six methods. The largest difference from the criterion method 
occurred in the standard deviation estimate produced by the aar method. However, even this 
difference is only .4 scale points. For the DASP scale, results firom the two methods using a single 
across-state population model depart some from the remaining methods. Both the aat ar.d aar 
estimates of the mean of the aggregate population are about 1 point higher than those ob:ained 
with the remaining methods, while the standard deviations are almost 3 points lower. Apparently, 
even for a fairly course aggregate statistic like the overall mean and standard deviatior of a 
collection a states, using a single population model seems to introduce some distortion in locations 
and unit of scale. This demonstrates that the particular conditioning model used is more important 
for cases with lower levels of information about individual profidend^. 

The means and standard deviations in Table 2 were used to linearly equate the scales 
produ(^ by each of the methods to the scale of the criterion method. Hiis was done ? -y that 
subsequent comparisons could be evaluated in terms of changes in results over and above those 
related to differences in location and unit of scale. It should be pointed out, however, that such 
adjustments aff^rt primarily the single population methods as little distortion was present ior the 
remaining methods. All subsequent results are in terms of these "equated** scales. 

Table 3 shows the means for each jurisdiaiun deviated from the national mean or the 
criterion method and for each of the alternative methods. In order to provide some frames jrk for 
evaluating the size of the differences produced by the various methods. Table 4 shows the 
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differences between each alieraaiive method and the criterion method divided by an estimate of 
the standard error of the criterion method result. In general, we regard small differences in 
standard error units as being below the threshold of noise inherent in the survey. It should be 
noted that the standard error shown reflect estimation error due to sampling examinees and error 
due to using multiple-imputation ^timates of profidency (Mislevy, Johnson, & Murald, 1992). 
However, the production version of the MCROUP program available at the time of these analyses 
estimates the population model and draws multiple-imputations treating F estimates as fixed across 
ioiputations. As a result, uncertainty associated with estimating F is not reflected in these standard 
errors. (The program to be used for the 1992 analyses incorporates this source of variance). 

For the NO scale, differences between the results from the criterion method and all four of 
the procedures that estimated s^arate population models for each jurisdiction were, with one 
excq>tion, smalL However, the estimate of the mean for the Virgin Islands (the most extreme 
jurisdiction) differed from the criterion method estimate by over two standard errors under the awt 
method and dose to one standard error under the awr method. It is interesting to note that both 
methods lightly underestimated how far the Virgin Island mean was from the national mean. For 
the methods that estimate a single population model, results from the aar method diffo-ed little 
from the criterion method. The aat method showed slightly larger differences for two of the 
participants (Hawaii and DC) but all differences were less than a standard error. 

Results are less satisfactory for the DASP scale. The means for Hawaii and D.C. differ from 
the criterion method by well over a standard error for both the methods y/ddch used acron-state 
population m^xieIs. The differences for the two states were in opposite directions however. 
Hawaii's distance from the national mean was slightly underestimated by both methods while D.C's 
was sligl tiy overestimated. 

Table 5 shows estimates of standard deviations for each jurisdiction obtained under the 
criterion method Table 5 also contains for each of the alternative methods the ratio of its 
estimated standard deviation to the criterion method estimate. Table 6 provides a listing of the 
difference from the criterion method standard deviation divided by the standard error of the 
criterion method standard deviation. This standard error is the square root of the sampling 
variance of the criterion method standard deviation, estimated t^' the jackknife repeated-rq>lications 
procedure (Johnson & Rust, 1992). For both the NO and DASP scales, results from the methods 
which estimate a single population model differ noticeably from the criterion method results. For 
the NO scale, differences exceed two standard errors for thiree of the eight jurisdiction for the aat 
method. Results for the aar method are only slightly better. Even more dramatic differences are 
evident for the DASP scale, where results differ from the criterion method in some cases by 4 to 
6 standard errors. Of the remaining methods, results appear closest to the criterion method for the 
methods which selected principal components by the percent-of-trace criterion- Both the wwt and 
awt methods evidenced only 1 of 16 differences larger than a standard error. The wwr and aar 
methods, though better than the smgle population model methods, did result in several differences 
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exceeding a standard error for each of the scales. 

As discus^ earlier^ a area of interest in NAEP results involves examining differeiK«s 
in aotdonic adiievement among the various demographic groups, as well as among grou[» defined 
by their standing with respect to certain policy relevant educational variables. As shown in Mislevy, 
Beaton, Kaplan, & and Sheehan (1992), the predictive models used in NAEP are essential to ensure 
accurate estimation of subgroup differences when examinees are administered a relatively small 
number of items. Therefore, it is of particular interest to examine estimates of subgroup differences 
obtained under each of the alternative methods. In addition, it of equal interest to ^ramine the 
extent to whkh the number of items taken by each examinee mitigates the effects of differences in 
the methods of obtaining jxjpulation models. As mentioned earlier, one would expect larger 
differences due to method for the DASP scale. 

TaUe 7 contains estimates of male-fonale differences as estimated by each of the pq)ulation 
methods. Table 8 contains the differences between each altematbe method and the criterion 
method divided by the criterion method standard error. Differences between the alternative 
methods and the criterion method are quite small for the NO scale, with one nouble exception. 
Under the criterion method, the gender difference in Hawaii was estimated as 8 points favoring 
females, compared to 0 points for the nation as whole. Under the aat and aar method, Hawaii's 
gender difference is shrunk to 5 and 43 points respectively, differences that are on the order of 2 
standard errors. Hawaii's results are equally unsatisfactory under the across^te methods for the 
DASP scale. In addition, the results for the Virgin Islands suggest that the size of the goider 
difference is underestimated, relative to the criterion method, by all six alternative methods and the 
underestimation is fairly substantial for three of methods (awt, wwr, and aar). Somewhat 
suiprisinglly, one of the three methods showing substantial differences from the criterion method 
was the wwr method, which involves estimating s^>arate models for each state. 

Table 9 contains estimates of mean proficiency differences between white and hispanic 
students and Table 10 shows the differences in terms of criterion method standard error units. For 
both the NO and DASP scales, it is apparent the methods using a single population model perform 
markedly less well than the other methods. As might be expected from the earlier discussion, the 
across-state methods performed particularly poorly for the DASP scales. In Florida, where the 
difference is somewhat smaller than that observed nationally, and for New Jer^ and Texas, where 
the difference is somewhat larger than that observed nationally, white-hispanic differences are 
substantially overestimated by both the single population model approaches. By contrast, in D.C. 
and Hawaii, where the differences are larger than that observed nationally, the white-hispanic 
difference is substantially underestimated. 

As part of the NAEP survey, the teachers of the assessed students were ask^ a variety of 
questions about their classes. One such question, included as a population variable, asked teachers 
to indicate the ability level of their classes (high, medium, low, or mixed). Table 10 shows mean 
proficiency differences between students in high and low ability classes, as identified by their 
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teachers. Table 11 contains estimates of mean proficiency differences between students in high and 
low abiUty classes and Table 12 shows the differences in criterion method standard error units. For 
Hawaii, where the difference was much larger than the difference observed nationally, results from 
the aa^)ss-state population models badly underestimated the size of difference between high and 
low ability classes. This underestimation occurred for the NO scale but was particularly noticeable 
for the DASP sc^e where the difference was estimated under the single population meUiods was 
onfy about half as large as that estimated under the criterion method. 

As part of the NAEP survey, the assessed students were asked a v^iety of questions about 
their badcgrounds» study habits, and out-of-school activities. One sudi que^on, included as a 
population variable, asked students how much television they watdied eadi day (1 hour or less, 2 
hours, 3 hours, 4-5 hours, 6 or more hours). Table 13 shows mean proficiency differences between 
students watching 1 hour or less and students watching 6 or more hours and Table 14 shows 
differences firom the ariterion method in standard error units. With one esmeption, differences 
between each of the methods and the criterion method are ^all for the NO scale. Howevor in the 
Virgin Islands, where the difference slightly favors students watching 6 or more hours of TV, the 
9aros$-state population procedure reverse the direction of the estimated effect This effect is even 
more dramatical^ evident for the DASP scale. Based on the criterion procedure, the mean 
proficiency for Virgin Island's students reporting more than 6 hours of TV watching is almost 11 
point higher than the proficiency mean for students reporting 1 hour or less of TV watching. This 
effect is markedly different than that observed for the remaining jurisdictions and for the nation 
as a whole. Perhaps this is due to the demographics of Virgin Islanc^ Studonts rqjorting less than 
1 hour of TV watching may come from lower SES homes where TV's may be less prevalent In 
contrast, the Virgin Island's results based on the across-state population methods suggest that 
students with 1 hour or less of TV watching have a 2 to 3 point advantage of students reporting 6 
or more hours of TV watching. Apparently, the estimated difference has been shrunk dramatical^ 
toward the national difference. 

SUMMARY AND CONCLUSIONS 

For both practical and theoretical reasons, the study reported here examined the adequacy 
of alternative procedures to estimating population models for use in NAEP's multiple imputations 
procedures. From a statistical viewpoint, the alternative prwredures can be a)nsidered as methods 
for obtaining estimates of the coefficients of the model subject to sets of linear constraints. The 
restricted estimators, though biased, may have superior properties from the point of view sampling 
/ariabUity or MSE. The least restrictive procedure (wwt and wwr) allow for separate estimates of 
*' and impose separate sets of linear constraints for each jurisdiction. Slightly more restrictive 
procedures (awt and awr) allow for separate estimates of T for each jurisdiction but impose a 
common set of constraints on aU such estimates. The most restrictive procedures (aat and aar) 
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require a single estimate of r and a common set of constraints. The methods of choosing prindpai 
components (percent-of-trace based and R'-based) can be viewed as dififerent empirical procedures 
for identi^dng the constraints to be imposed. 

From a practical viewpoint, the more restricted alternatives can be viewed as ways of 
attempting to reduce the work load and computing requirements associated with the TSA, hence 
shortening reporting deadlines and reducing costs. Using the 1990 TSA as an example, the wwt and 
wwr method involve 40 sets of analyses to determine principal components and 40 sets of anafyses 
to cany out model estimation. The awt and awr methods still require 40 sets of model estimation 
analyses, but the number of principal component anal^es is reduced to 1. The aat and aar methods 
require only a single set of prindpai component analyses and a single set of model estimation nms. 
It should be clear from a practical standpoint why, if accqitable results could be obtain^ the 
restricted procedures would be attractive. 

In our opinion, the analyses reported here indicate that the methods which estimate a sin^ 
population model do not provide acceptable results. Even for higMy aggregated quantities such as 
the mean and standard deviation of the o^mposite population of the eight ^tes studied here, the 
aat and aar methods did not produce adequate results. When compared to the criterion procedure, 
the aat and aar methods showed sutetantial difference in estimating the mean and standard 
deviations for several of the eight jurisdictions studied here. For each of the four empiricai 
contrasts emmined (male-female, white-hii^nic, high ability-low ability, and 1 hour of TV-6 hours 
of TV), results from the sin^e-population-model methods differed noticeabfy from the results 
obtained from the criterion method for several of the jurisdictions. In particular, constrasts the that 
were markedfy different in magnitude than those observed for the nation (e.g., ability group 
dijfferences in Hawaii) were often badly underestimated These results suggest that the relationship 
between background variables and proficiencies is sufGdently different ao-oss jurisdictions 
necessitate the use of separate population models for each TSA participant. 

Differences in performance among the remaining alternatives are less distinct However, on 
balance, one would have to conclude from the results presented here that the wwt method, the 
method used operationally for the 1990 TSA, produced the most satisfactory results. Across all of 
the state mean and contrast comparisons carried out, results from the wwt method differed by less 
than a standard error from those obtained using the criterion metiiod. In most cases, the 
differences were considerably less. The single instance in which results for the wwt method differed 
from those obtained using the criterion method occurred for the standard deviation of Texas on the 
DASP scale (1.04 standard error units). Consequently, current plans are to continue to use the wwt 
method for analysis of the 1992 TSA. 

While the remaining separate-model methods did not perform poorly, their results were 
clearly less satisfactory than the wwt method. Differences in overall state means, and estimates of 
empirical contrasts were generally small for the wwr method. For only two of the contrasts did the 
differences from the criterion method exceed one standard error. In lx>th cases, this occurred on 
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the DASP scale. However, in estimating state standard deviations, the wwr method p«formed 
noticeabfy less well than the wwt method. For the former, five of differences from the criterion 
methcKi estimates «ceeded a standard error in magnitude. The oorre^nding numbor was oiify 
1. The awt and awr methods did not often result in large differoices from the criterion method. 
However, a few substantial differences in overall state and subgroup means were found, thou^ 
almost all appeared to be limited to the awt method applied to the data from the Virgin Islands. 
For the aar method, state standard deviations were misestimated ^mewhat for Hawaii and North 
Dakota. 

Although the results reported here tend to support the use of the less restricted models in 
general, and the wwt method in particular, a definitive statement on the various methods would be 
someiiiiat premature. The anafyses r^iorted on h^e have focussed on comparing povtf esdmates 
of state means, state standard deviations, and selected empirkal contrasts. Such comparisons are 
extremely important from the point of view of identifying unacceptable biases that might result from 
the various procedures. In our judgement, their is sufficient evidence of unwanted biases in the 
single population models to warrant their exclusion from future consideration. However, results are 
1^ dear cut for the various procedure which estimate separate population models for each state 
and, before making definitive conditions, it would probabb)^ be wise to exatnine these alternatives 
from the point of view of interval estimates, sampling variances, and MS£. Future studies will 
extend this work along those lines. 

In future research, we also intend to ejq^and the list of alternative procedures somewhat. 
For example, one procedure being given consideration involves combining contrasts and princq>ai 
components in a single model The method would proceed as follows. First a relatively small set 
of key variables would be identified. Most likely, this set of variables would consist of the major 
NAEF reporting variables (gender, race/ethnidty, type of community, parental education, # of 
reading materials in the home, etc). Contrasts would be defined for these variables. The 
remaining larger set of background variables would also be expressed as contrasts. The second set 
would be residualized for the first set and then subjected to a principal compon^ts analysis. The 
final set of variables appearing in the population model would consist of the first set of contrasts 
and some (hq>eful]y) small number of prindpal components from the remaining contrasts. 
Preliminary analyses with this approach are encouraging (Nelson, 1992). However, the necessary 
evaluative research will most likely not be completed in time to impact the current assessment's 
results. 

Before concluding, there are several additional points that should be noted regarding the 
quantification of uncertainty in NAEP results. The standard errors used in this report reflect two 
sources of error, a component due to sampling and a component due to imputing rather than 
observing Q. This later component is estimated by generating multiple plausible values for each 
examinee (NAEP uses 5), treating these as making up five full data sets, and quantifying the 
between set variance of results. Such an approach follows Rubin's (1987) suggestions for 
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incorporating imputation error. However, in NAEP's computing machinery prior to 1992 
(MGROUP), each set of plausible values are selected from a predictive distribution calculated using 
a fixed set of item parameters and a fixed estimate of P. Thus, the uncertainty associated with 
these estimates in not reflected in the imputations component of variance. As a result, NAEFs 
reported standard errors are somewhat optimistia 

A more satisfactory approach, one consistent with Rubin's suggestions, would be to: 1) take 
a 5 random draws from the (X^sterior distribution of item parameters, 2) take 5 random draws from 
the posterior distribution of F, and, 3) produce fhre sets of plausible values, each using a different 
estimate of item parameters and F. Such a procedure would provide more realistic estimates of 
the component of variability due to imputation. Plans for incorporating uncertainty about F are 
already in progress. Neal Thomas at ETS is producing enhanced versions of the MGROUP 
program that will accomplish this and plans are that standard errors for the 1992 NAEP results will 
reflect the uncertainty involved in using estimates of F. Work in progress by Mislevy, Sheehan, and 
Wingersky (1991) is directed toward incorporating information about the uncertainty arising from 
using item parameter estimates. 

An additional source of uncertainty in NAEP results, one demonstrated here, is their 

sensitivity to models of nonresponse. As noted earlier, the plausible values approach used in NAEP 

is an adaptation of Rubin's (19S7) model-based procedures for nonre^nse in survey. Rubin has 

talked about the need to display the sensitivity of results to different models of nonresponse. 

If in a particular survey without follow-up response, there is no single accepted dass 
of assumptions about nonresponse, then it is obviously prudent to perform data 
analyses under a variety of plausible models for nonresponse. If (a) inferences vary 
in importara ways as the modeb change arui(b) the data caw 
as inof^priate, then the tautotoguxU conclusion must be that the data cannot support 
sharp inferences without further specification ofUie models, (page 17, emphasis added) 

In the current study, the various approaches constitute different models. While it appears to be the 

case that the data can eliminate some of the models (Le., those which assume a single population 

model for all jurisdictions), a decision among the remaining models is 1^ clear cut and one could 

perhaps consider then as competing alternatives. 

With the exertion of the single population approaches, differences in results among the 

alternative procedures did occur. Such differences were, for the most part small, and in our opinion 

unlikely to affect the types of inferences typicalfy made about NAEP results. Nevertheless, such 

differences do underscore the fact that there is an additional source of uncertainty not reflected in 

standard errors for NAEP results, the uncertainty associated with choice of nonresponse model. 

Perhaps a direction for future work would be to consider how to incorporate this kind of uncertainty 

into published standard error estimates. 



20 



References 



Belinfante, A. & Coxe, K. (1986) Principal components regression - Selection rules and application. 
Proc Bus. & Econ. Sea, Amsr. Stat. Assoc. 429431. 

Coxe, K. (1982) MulticoUinearity, pricipal components regression. & selection rules for 
these components. Proc Bus, <& Econ. Sec, Amer. Stat Assoc, 449-453. 

Gunst, R.F. & Mason, R.L. (19T7). Biased estimation in regression: An evaluation using 
mean squared error. Journal of the American Stattsdcal Association, 72y 616-628. 

Johnson, EJ. & Mislevy, RJ. (1991) Theoretical background and philosophy of NAEP 

scaling procedures. In S.L. Koffler, The technical report ofNAEP's 1990 Trial State 
Assessment Program (No. 21-STOl). Wash. DC: National Center for Education 
Statistics. 

Johnson, EJ. & Rust, K.F. (1992) Population inferences and variance estimation for NAEP data. 
Jourrud of Educational Statistics (in press). 

Joliffe. LT. (1986) Principal component analysis. New York: Springer- Verlag 

S.L. Koffler (1991) The technical report ofNAEP's 1990 Trial State Assessment Progrwn 
(No. 21-STOl). Wash. DC: National Center for Education Statistics. 

Kerlinger, F.N. & Pedhazur, EJ. (1973). Multiple regression in behavioral research. 
New York: Holt, Rinehart, and Winston 

Lott, W.F. ( 1973). The optimal set of principal component restrictions on a least-squares 
regression. Comrmmications in Statistics, 2, 449-464. 

Mazzeo, J. (1991) Data analysis and scaling. In S.L. Koffler, The technical report ofNAEP's 
1990 Trial State Assessment Program (No. 21-STOl). Wash. DC: National Center for 
Education Statistics. 

Messick, S., Beaton. A.E., & Lord, F. {\9%'i).A new design for a new era. NAEP report 83-1. 
Princeton. NJ: National Assessment of Educational Progress. 



21 

22 



Misievy, RJ. (1985) Estimation of latent group effects. Journal of the American Statistical 
Association, 80, 993-997. 

Misievy, RJ. (1991) Randomization-based inference about latent variables from complex samples. 
Psychometrika, 56, 177-196. 

Misievy, RJ., Beaton, A.E., Kaplan, B. & She^ ..an, K.M.. (1992) Estimating population 

characteristics from sparse matrix samples of item re^nses. Journal of Educa tio nal 
M&2surement, (in press). 

Misievy, RJ. & Bock, R.D, (1982) BJLOG: Item analysis and test scoring with binary logistic 
models [Computer program]. Mooresville, IN: Scientific Software. 

Misievy, RJ., Johnson, EJ. & Muraki EJ. (1992). Scaling procedures in the National Assessment 
of Educational Progress. /owmo/ of Educational Statistics, (in press). 

Misievy, RJ., Sheehan, K.M., & Wingersky, M. (1991) How to equate tests with Uttle or no data. 
Unpublished manusoript. 

Nelson, J. & Bowker, D. (1992, April) Using principal components in conditioning NAEP 
CTOss-sectional scales. Paper presented at the annual meeting of the American 
Educational Research Association. 

Rubin, D.B. ( 1987) Multiple imputation for norvesponse in surveys. New York: Wiley. 

Sheehan, K.M. (1985). M-CROUP: Estimation of group effects in multivariate models. 
[Computer program] Princeton. NJ: Educational Testing Service 



22 



Table 1 Sample sizes and number of principal compone nts included 
in the population model for each state for the 
criterion^ wwr and wwt methods 





N 


crit 




wwr 


CA 


2424 


126 


90 


90 


DC 


2135 


119 


87 


84 


FL 


2534 


125 


91 


81 


HI 


2551 


123 


88 


88 


lU 


2710 


122 


89 


81 


NO 


2485 


117 


86 


100 


TX 


2542 


124 


90 


84 


VI 


1326 


115 


82 


78 



ERIC 



oi 



Table 2 - Means and standard deviation s for the aQoreoate 

population of the eight states estimated bv each 
alternative procedure 

NO Scale: 

crit wwt awt aat wwr awr aar 

mean 257.8 257.8 258.0 257.8 257.8 257.9 257.9 

sd 38.1 38.1 37.9 37.9 37.9 38.1 37.7 

OASP Scale: 

crit wwt awt aat wwr awr aar 

mean 247.4 247.6 247.1 248.7 247.6 247.5 249.0 

sd 49.6 49.6 50.1 46.8 49.3 49.6 46,2 



ERIC 



Table 3 - Adjusted state means de viated from national mean 



NO Scale (National mean n 266) 





crit 


wwt 


awt 


aat 


wwr 


awr 


aar 


NO 


20.1 


19.8 


19.9 


20.0 


20.1 


20.2 


20.4 


NO 


7.3 


7.5 


7.1 


7.4 


7.4 


7.3 


7.2 


TX 


-4.1 


-4.1 


-4.3 


-4.1 


-4.0 


-4.2 


-4.2 


FL 


-5.9 


-5.8 


-6.4 


-6.1 


-5.9 


-6.1 


-6.0 


CA 


-6.5 


-6.5 


-6-8 


-6.4 


-6.5 


-6.7 


-6.5 


HI 


-10.0 


-9.9 


-10.0 


-9.3 


-9.8 


-10.0 


-9.5 


DC 


-28.2 


-27.8 


-28.3 


-28.8 


-28.2 


-28.4 


-28.5 


VI 


-38.7 


-39.0 


-37.3 


-38.6 


-39.0 


-38.1 


-38.8 



DASP Scale (National mean ° 262): 





crit 


wwt 


awt 


aat 


wwr 


awr 


aar 


ND 


23.4 


23.6 


23.1 


24.3 


23.5 


23.7 


24.8 


NJ 


8.0 


7.8 


8.3 


7.3 


8.0 


7.7 


7.0 


TX 


-5.5 


-5.8 


-5.5 


-5.8 


-5.8 


-5.6 


-6.3 


FL 


-7.2 


-7.5 


-7.3 


-7.9 


-7.2 


-7.3 


-7.3 


CA 


-7.8 


-7.8 


-6.9 


-7.9 


-7.7 


-7.5 


-7.9 


HI 


-19.3 


-19.8 


-19.4 


-17.4 


-19.7 


-19.6 


-17.7 


DC 


-41.0 


-40.6 


-40.3 


-42.2 


-40.8 


-40.7 


-42.3 


VI 


-67.3 


-66.5 


-68.7 


-67.1 


-66.9 


-67.3 


-66.9 



?7 



Table 4 Ad.iusted state means -- Deviation from criterion results 
In se(criterlon) units 



NO Scale: 

MVift 

NO -0.17 

NJ 0.13 

TX -0.03 

FL 0.05 

CA -0.03 

HI 0.07 

DC 0.47 

VI -0.52 

DASP Scale: 

wwt 

ND 0.11 

NJ -0.17 

TX -0.20 

FL -0.23 

CA -0.02 

HI -0.45 

DC 0.48 

VI 0.90 



awt 
-0.15 
-0.16 
-0.14 
-0.40 
-0.21 

0.00 
-0.13 

2.40 



aat 
-0.08 

0.10 
-0.03 
-0.20 

0.04 

0.83 
-0.81 

0.18 



wwr 
-0.01 
0.10 
0.06 
-0.03 
-0.03 
0.27 
-0.05 
-0.49 



awr 
0.05 
-0.04 
-0.06 
-0.14 
-0.13 
0.02 
-0.26 
0.96 



aar 
0.18 
-0.05 
-0.07 
-0.08 
-0.01 
0.51 
-0.42 
-0.23 



awt 
-0.17 

0.24 
-0.02 
-0.08 

0.46 
-0.11 

0.76 
-1.51 



aat 
0.54 
-0.47 
-0.18 
-0.49 
-0.07 
1.93 
-1.32 
0.20 



wwr 
0.05 
-0.01 
-0.18 
-0.02 
0.04 
-0.41 
0.27 
0.40 



awr 
0.15 
-0.25 
-0.09 
-0.10 
0.13 
-0.25 
0.39 
0.05 



aar 
0.81 
-0.75 
-0.44 
-0.08 
-0.03 
1.56 
-1.48 
0.42 



Table 5 



Adjusted state standard deviations relative to criterion 
results from the criterion procedure 



NO Scale: 





crit 


wwts 


awt 


aat 


wwr 


awr 


aar 


CA 


37.1 


1.01 


1.02 


0.99 


1.02 


0.99 


0.99 


DC 


31.9 


1.00 


1.01 


1.05 


0.97 


0.98 


1.02 


FL 


34.7 


0.99 


1.00 


1.00 


1.00 


0.99 


1.01 


HI 


38.2 


1.00 


1.01 


0.93 


1.02 


1.03 


0.95 


NJ 


35.2 


0.99 


1.00 


1.04 


1.00 


1.01 


1.04 


ND 


30.2 


1.01 


0.98 


1.00 


0.97 


0.99 


1.02 


TX 


34.0 


0.99 


1.00 


0.98 


1.01 


1.00 


0.99 


VI 


29 3 


1.01 


1.01 


1.01 


0.98 


1.01 


0.99 


OASP 


Scale: 
















crit 


wwts 


awt 


aat 


wwr 


awr 


aar 


CA 


43.9 


0.99 


0.99 


1.00 


1.00 


1.00 


1.01 


DC 


42.3 


1.00 


1.00 


1.00 


0.99 


1.00 


1.01 


FL 


42.5 


1.00 


1.00 


1.04 


1.02 


1.01 


1.04 


HI 


48.1 


1.02 


1.00 


0.89 


1.02 


1.01 


0.91 


NJ 


40.8 


1.00 


0.99 


1.13 


1.01 


1.00 


1.11 


ND 


33.2 


0.99 


0.97 


1.05 


0.98 


0.93 


1.05 


TX 


41.7 


1.02 


1.00 


1.02 


1.01 


1.01 


1.01 


VI 


41.7 


1.00 


1.01 


0.89 


0.97 


1.01 


0.88 



Table 6 



Adjusted state standard deviations - differences from 
criterion method results In sefcrlterion) units 



NO Scale: 





wwt 


awt 


aat 


wwr 


awr 


aar 


CA 


0.48 


1.06 


-0.42 


0.74 


-0.56 


-0.54 


DC 


-0.17 


0.53 


2.10 


-1.16 


-0.88 


0.99 


FL 


-0.30 


0.14 


0.08 


-0.18 


-0.45 


0.34 


HI 


0.22 


0.55 


-3.25 


1.00 


1.24 


-2.60 


NJ 


-0.43 


-0.12 


2.07 


0.09 


0.60 


1.94 


ND 


0.32 


-0.61 


0.05 


-1.12 


-0.18 


0.60 


TX 


-0.40 


-0.18 


-0.96 


0.46 


0.27 


-0.80 


VI 


0.29 


0.60 


0.38 


-0.75 


0.40 


-0.65 


DASP Scale: 












wwt 


awt 


aat 


wwr 


awr 


aar 


CA 


-0.37 


-0.50 


-0.20 


0.20 


-0.10 


0.55 


DC 


0.04 


-0.10 


0.03 


-0.38 


-0.12 


0.41 


FL 


0.04 


0.14 


2.03 


1.00 


0.75 


1.91 


HI 


0.85 


-0.16 


-6.03 


0.96 


0.46 


-5.05 


NJ 


0.03 


-0.40 


4.60 


0.33 


0.17 


3.71 


ND 


-0.24 


-0.80 


1.42 


-0.69 


-2,05 


1.52 


TX 


1.04 


-0.18 


0.95 


0.29 


0.31 


0.61 


VI 


-0.06 


0.57 


-4.14 


-1.13 


0.56 


-4.81 



3i: 

ERIC 



Table 7 Male - female differences --deviated from national mean 



NO Scale (National mean ° 0): 





crit 


wwt 


awt 


NO 


5.7 


5.9 


4.8 


TX 

• •* 


4.0 


3.4 


3.9 


NJ 


3.5 


3.7 


3.4 


Ft 


1.2 


0.7 


1.4 


CA 


1.0 


0.8 


0.5 


VI 


0.6 


1.4 


-0.2 


DC 


-3.6 


-4.3 


-4.3 


HI 


-8.0 


-7.4 


-7.3 




Scale (National mean ' 




crit 


wwt 


awt 


NO 


6.8 


7.8 


7.6 


VI 


5.7 


4.3 


2.4 


FL 


4.5 


4.9 


4.7 


NJ 


4.0 


3.8 


4.0 


CA 


3.0 


3.5 


2.9 


TX 


2.6 


1.6 


2.2 


DC 


-1.9 


-2.5 


-1.6 


HI 


-5.6 


-5.2 


-4.1 



aat 


wwr 


awr 


aar 


4.1 


4.7 


5.7 


5.5 


3 3 




3 7 


3 3 


2 8 








1 8 






1 7 


7 7 


1 1 


ft 1 






n A 


0 7 




'7 5 




7 


-3 3 


-5 0 


-8 2 




-4 3 


aat 


wwr 


awr 


aar 


6.3 


6.7 


7.9 


7.5 


4.2 


1.9 


4.4 


1.1 


4.7 


3.7 


4.3 


4.0 


4.6 


4.0 


3.3 


4.9 


3.6 


1.9 


3.7 


3.9 


3.6 


2.3 


2.3 


3.2 


-1.6 


-1.5 


-2.5 


-2.3 


-2.2 


-6.7 


-5.3 


-2.1 



ERIC 
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Table 8 



M-F difference -Deviation from criterion meth od results 
in sefcrlterion) units 



NO Scale: 





wwt 


awt 


ND 


0.08 


-0.38 


TX 


-0.29 


-0.05 


NJ 


0.10 


-0.05 


FL 


-0.24 


0.10 


CA 


-0.09 


-0.22 


VI 


0.53 


-0.53 


DC 


-0.50 


-0.50 


HI 


0.35 


0.41 


DASP Scale: 






wwt 


awt 


NO 


0.37 


0.30 


VI 


-0.58 


-1.38 


FL 


0.15 


0.08 


NJ 


-0.09 


0.00 


CA 


0.17 


-0.03 


TX 


-0.36 


-0.14 


DC 


-0.32 


0.16 


HI 


0.20 


0.75 



aat wwr 

-0.67 -0.42 

-0.33 0.14 

-0.35 0.00 

0.29 0.14 

0.52 0.04 

0.33 -0.13 

0.79 0.29 

1.76 -0.18 



awr aar 

0.00 -0.08 

-0.14 -0.33 

-0.25 -0.25 

-0.33 0.24 

-0.39 0.26 

-0.27 0.00 

-0.43 0.21 

0.29 2.18 



aat wwr 

-0.19 -0.04 

-0.63 -1.58 

0.08 -0.31 

0.26 0.00 

0.21 -0.38 

0.36 -0.11 

0.16 0.21 

1.7P -0.55 



awr aar 

0.41 0.26 

-0.54 -1.92 

-0.08 -0.19 

-0.30 0.39 

0.24 0.31 

-0.11 0,21 

-0.32 -0.21 

0.15 1.75 



Table 9 



Whlte-Hispanic difference deviated from national mean^ 



NO Scale (National mean « 25): 





crit 


wwt 


awt 


DC 


41.3 


38.4 


38.8 


CA 


8.0 


7.8 


8.7 


ND 


6.8 


5.8 


7.3 


NJ 


6.7 


6.3 


7.4 


HI 


5.3 


5.4 


5.6 


TX 


1.6 


1.5 


1.5 


FL 


-6.8 


-7.0 


-7.3 



aat 


wwr 


awr 


aar 


36.0 


39.6 


40.0 


33.4 


7.8 


8.0 


7.8 


8.5 


2.9 


7.8 


6.4 


6.0 


9.6 


7.5 


8.0 


10.4 


2.9 


5.9 


6.5 


2.8 


2.3 


2.2 


2-5 


2.6 


-5.8 


-7.4 


-6.4 


-4.6 



DASP Scale (National mean » 23): 



DC 


52.4 


50.6 


49.1 


42.8 


54.2 


51.8 


44.5 


HI 


13.7 


13.7 


12.6 


4.4 


11.8 


10.6 


6.6 


CA 


12.0 


10.9 


11.4 


12.5 


10.8 


11.7 


12.7 


NJ 


11.8 


13.7 


11.4 


18.8 


11.9 


13.0 


17.9 


TX 


2.0 


2.4 


1.7 


6.2 


2.1 


3.5 


6.3 


ND 


-3.9 


-7.4 


-4.9 


-5.1 


-0.6 


-4.9 


-2.1 


Fl 


-9.8 


-9.4 


-10.9 


-4.3 


-8.6 


-9.1 


-3.4 



^Results not reported for Virgin Islands due to insufficient sample sizes for white examinees 



Table 10 



White-Hispanic difference --Deviation from criterion 
method resul ts In se(criterlQn) units^ 



NO Scale: 





wwt 


awt 


ou 


-0.57 


-0.49 


CA 


-0.08 


0.27 


ND 


-0.15 


0.07 


NJ 


-0.13 


0.23 


HI 


0.03 


0.09 


TX 


-0.05 


-0.05 


FL 


-0.07 


-0.19 


VI 


-0.53 


0.47 


DASP 


Scale: 






wwt 


awt 


DC 


-0.19 


-0.35 


HI 


0.00 


-0.27 


CA 


-0.33 


-0.18 


NJ 


0.46 


-0.10 


TX 


0.13 


-0.10 


NO 


-0.51 


-0.15 


FL 


0.13 


-0.37 


VI 


-0.40 


-1.01 



aat vfwr 

-1.04 -0.33 

-0.08 0.00 

-0.57 0.15 

0.97 0.27 

-0.69 0.46 

0.33 0.29 

0.37 -0.22 

-0.24 0.17 



awr aar 

-0.25 -1.55 

-0.08 0.19 

-0.06 -0.12 

0.43 1.23 

0.34 -0.71 

0.43 0.48 

0.15 0.81 

0.36 0.08 



aat 
-1.01 
-2.32 
0.15 
1.71 
1.40 
-0.18 
1.83 
0.67 



wwr 
0.19 
-0.47 
-0.36 
0.02 
0.03 
0.49 
0.40 
1.28 



awr 
-0.06 
-0.77 
-0.09 
0.29 
0.50 
-0.15 
0.23 
0.54 



aar 
-0.83 
-1.78 
0.21 
1.49 
1.43 
0.26 
2.13 
1.35 



ERIC 



'Results not reported for Virgin Islands due to insufficent sample sizes for white examinees. 



Table 11 Hlqh-Low ability difference deviat ed from nat ional mean 



NO Scale (National mean ° 49): 





crit 


wwt 


awt 


HI 


19.2 


19.5 


19.5 


FL 


10.6 


10.5 


10.7 


CA 


10.4 


9.2 


10.2 




9.1 


8.3 


8.6 


ND 


6.9 


5.8 


3.9 


TX 


5.5 


5.8 


6.0 


DC 


-4.3 


-5.4 


-4-9 


VI 


-14.7 


-15.8 


-14.4 




Scale (National mean » 




crit 


wwt 


awt 


HI 


30.0 


31.3 


29.2 


FL 


13.7 


14.3 


14.3 


TX 


9.0 


10.1 


9.7 


CA 


7.4 


7.6 


7.3 


DC 


5.7 


2.3 


3.1 


HO 


5.3 


3.8 


2.1 


ND 


-0.7 


-0.6 


-1.0 


VI 


-4.0 


-2.2 


-3.5 



aat 


wwr 


awr 


aar 


12.4 


19.6 


19.1 


12.7 


11 1 


10 1 


9 4 


10 8 


8 7 


10 1 


9 6 


8 5 


10 7 


9 5 


8 4 


10 2 


3 9 


5 0 


7 8 




5 2 


3 9 


5 6 


4 5 


-1 3 


-6 1 






12 1 

• 


-14 8 


-14 7 


-11 6 

X X • w 


aat 


wwr 


awr 


aar 


16.0 


29.1 


28.9 


16.8 


16.9 


13.5 


12.1 


16.4 


9.2 


9.6 


8.0 


9.4 


11.6 


10.7 


7.7 


10.0 


9.0 


4.6 


3.1 


7.5 


12.9 


5.4 


5.9 


12.6 


3.2 


-3.1 


1.9 


5.2 


-8.0 


-3.9 


-5.1 


-6.7 



Table 12 



High-Low difference - Deviation from criterion method 



results In $e(cr1terion) qpit^ 



NO Scale: 





wwt 


awt 


aat 


wwr 


awr 


aar 


HI 


0.15 












FL 


-0.03 


0.03 


0.17 


-0.17 


-0.40 


0.07 


CA 


-0.44 


-0.07 


-0.63 


-0.11 


-0.30 


-0.70 


NJ 


-0.22 


-0.14 


0.44 


0.11 


-0.19 


0.31 


ND 


-0.26 


-0.71 


-0.71 


-0.45 


0.21 


-0.17 


TX 


0.09 


0.16 


-0.09 


-0.50 


0.03 


-0.31 


DC 


-0.35 


-0.19 


0.97 


-0.58 


-0.81 


0.58 


VI 


-0.32 


0.09 


0.76 


-0.03 


0.00 


0.91 


DASP Scale: 














wwt 


awt 


aat 


wwr 


awr 


aar 


HI 


0.48 


-0.30 


-5.19 


-0.33 


-0.41 


-4.89 


FL 


0.17 


0.17 


0.91 


-0.06 


-0.46 


0.77 


TX 


0.26 


0.16 


0.05 


0.14 


-0.23 


0.09 


CA 


0.06 


-0.03 


1.27 


1.00 


0.09 


0.79 


DC 


-0.97 


-0.74 


0.94 


-0.31 


-0.74 


0.51 


NJ 


-0.31 


-0.65 


1.55 


0-02 


0.12 


1.49 


NO 


0.01 


-0.04 


0.57 


-0.35 


0.38 


0.87 


VI 


0.40 


0.11 


-0.89 


0.02 


-0.24 


-0.60 



Table 13 TVl - TVS difference deviated from national mean 



NO Scale (National mean ° 24): 





crit 


wwt 


awt 


aat 


wwr 


awr 


aar 




8.9 


9.2 


9.1 


11.2 


9.9 


9.4 


10.3 


ND 


-3.3 


-1.1 


-2.6 


-1.2 


-2.5 


-2.6 


-2.3 


CA 


-3.6 


-3-7 


-2.7 


-4.4 


-4.1 


-3.0 


-4.0 


FL 


-4.5 


-4.1 


-4.5 


-4.6 


-4.5 


-4.7 


-4.3 


TX 


-9.4 


-9.3 


-9.0 


-8.0 


-9.5 


-9.9 


-8.5 


HI 


-9.4 


-9.0 


-9.8 


-8.6 


-9.8 


-9.0 


-9.6 


DC 


-17.7 


-18.5 


-16.0 


-15.5 


-17.6 


-17.4 


-13.9 


VI 


-27.3 


-25.9 


-22.8 


-23.3 


-26.0 


-26.4 


-22.7 




Scale (National mean « 












crit 


wwt 


awt 


aat 


wwr 


awr 


aar 


NJ 


18-1 


18.5 


16.5 


21.9 


18.4 


17.0 


22.7 


NO 


2.6 


4.4 


4.0 


7.2 


4.6 


2.8 


4.9 


CA 


0.0 


1.3 


1.3 


2.8 


-0.7 


1.5 


2.5 


TX 


-0.5 


-0.7 


0.3 


0.1 


-3.7 


-2.1 


-1.6 


HI 


-3.1 


-3.9 


-4.1 


-3.8 


-5.4 


-5.4 


-4.5 


FL 


-3.6 


-3.5 


-5.4 


0.5 


-6.5 


-3.2 


-0.1 


DC 


-11-1 


-8.6 


-9.7 


-8.2 


-11.0 


-8.3 


-6.2 


VI 


-34.6 


-34.5 


-34.9 


-20.8 


-29.9 


-32.1 


-22.1 



i i 
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Table 14 



TVl - TV6 difference Deviation from criterion method 
results in se(criterion) units 



NO Scale: 





MWt 


awt 


HO 


0.08 


0 05 


ND 


0.51 


0 16 


CA 


-0.03 


0 22 


FL 


0.12 


0 00 


TX 


0.03 


D 11 


HI 


0.12 


-0 12 


DC 


-0.13 


0 27 


VI 


0.52 


1 67 


DASP Scale: 






VfWt 


awt 


NJ 


0.09 


-0.34 


ND 


0.38 


0.30 


CA 


0.28 


0.28 


TX 


-0.04 


0.16 


HI 


-0.19 


-0.24 


FL 


0.02 


-0.43 


DC 


0.30 


0.17 


VI 


0.03 


-0,10 



aat 
0.59 
0.49 
-0.20 
-0.03 
0.39 
0.24 
0.35 
1.48 



wwr 
0.26 
0.19 
-0.12 
0.00 
-0.03 
-0.12 
0.02 
0.48 



awr 
0.13 
0.16 
0.15 
-0.06 
-0.14 
0.12 
0.05 
0.33 



aar 
0.36 
0.23 

-0.10 
0.06 
0.25 

-0.06 
0.60 
1.70 



aat wwr 

0.81 0.06 

0.98 0.43 

0.60 -0.15 

0.12 -0.65 

-0.17 -0.55 

0.98 -0.69 

0.35 0.01 

4.60 1.57 



awr aar 

-0.23 0.98 

0.04 0.49 

0.32 0.53 

-0.33 -0.22 

-0.55 -0.33 

0.10 0.83 

0.34 0.60 

0.83 4.17 



Appendix 1 



Number of contrasts, number of 0 eigenvalues, and the 
number of components deleted due to computational 
instabilities 



# of # of 0 # of unstable 

contrasts eigenvalues PCs 



CA 


166 


15 


25 


DC 


166 


20 


27 


FL 


166 


16 


25 


HI 


166 


19 


24 


NJ 


166 


16 


28 


NO 


166 


21 


28 


TX 


166 


27 


15 


VI 


166 


26 


25 



'V1 

% > • 

ERIC 



Appendix 2 Subgroup sample sizes for each .jurisdiction 



CA DC FL HI NJ ND TX VI 
Males 1244 1003 1291 1341 1350 1279 1261 644 

Females 1180 1132 1243 1210 1350 1206 1281 682 



CA DC FL HI NJ ND TX VI 
Whites 1091 54 1548 445 1789 2234 1175 20 
Hispanics 818 192 398 264 363 70 926 265 



CA DC FL HI NO NO TX VI 
High Abil 638 287 602 588 639 254 371 162 
Low Abil 394 378 489 653 541 179 407 248 



CA DC FL HI NJ ND TX VI 
1-Hr TV 398 157 299 252 332 334 313 229 
6-Hr TV 262 716 471 572 349 158 375 351 




